class: center, middle, inverse, title-slide .title[ # APEC8211: Recitation 5 ] .author[ ### Shunkei Kakimoto ] --- class: middle <style type="text/css"> <!-- center figure --> .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .small-code .remark-code{ font-size: 50% } .medium-code .remark-code{ font-size: 70% } .xlarge { font-size: 150% } .large { font-size: 130% } .medium { font-size: 80% } .small { font-size: 70% } .xsmall { font-size: 50% } .remark-slide-number { display: none; } .one-page-font { font-size: 14px; } .remark-slide-content.hljs-github h1 { margin-top: 5px; margin-bottom: 25px; } .remark-slide-content.hljs-github { padding-top: 10px; padding-left: 30px; padding-right: 30px; } .panel-tabs { <!-- color: #062A00; --> color: #841F27; margin-top: 0px; margin-bottom: 0px; margin-left: 0px; padding-bottom: 0px; } .panel-tab { margin-top: 0px; margin-bottom: 0px; margin-left: 3px; margin-right: 3px; padding-top: 0px; padding-bottom: 0px; } .panelset .panel-tabs .panel-tab { min-height: 40px; } .remark-slide th { border-bottom: 1px solid #ddd; } .remark-slide thead { border-bottom: 0px; } .gt_footnote { padding: 2px; } .remark-slide table { border-collapse: collapse; } .remark-slide tbody { border-bottom: 2px solid #666; } .important { background-color: lightpink; border: 2px solid blue; font-weight: bold; } .remark-code { display: block; overflow-x: auto; padding: .5em; background: #ffe7e7; } .remark-code, .remark-inline-code { font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace;font-size: 90%; } .hljs-github .hljs { background: #f2f2fd; } .remark-inline-code { padding-top: 0px; padding-bottom: 0px; background-color: #e6e6e6; } .r.hljs.remark-code.remark-inline-code{ font-size: 0.9em } .left-full { width: 80%; float: left; } .left-code { width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .left6 { width: 60%; height: 92%; float: left; } .left5 { width: 49%; <!-- height: 92%; --> float: left; } .right5 { width: 49%; float: right; padding-left: 1%; } .right4 { width: 39%; float: right; padding-left: 1%; } .left3 { width: 29%; height: 92%; float: left; } .right7 { width: 69%; float: right; padding-left: 1%; } .left4 { width: 38%; float: left; } .right6 { width: 60%; float: right; padding-left: 1%; } ul li{ margin: 7px; } ul, li{ margin-left: 15px; padding-left: 0px; } ol li{ margin: 7px; } ol, li{ margin-left: 15px; padding-left: 0px; } </style> <style type="text/css"> .content-box { box-sizing: border-box; background-color: #e2e2e2; } .content-box-blue, .content-box-gray, .content-box-grey, .content-box-army, .content-box-green, .content-box-purple, .content-box-red, .content-box-yellow { box-sizing: border-box; border-radius: 5px; margin: 0 0 10px; overflow: hidden; padding: 0px 5px 0px 5px; width: 100%; } .content-box-blue { background-color: #F0F8FF; } .content-box-gray { background-color: #e2e2e2; } .content-box-grey { background-color: #F5F5F5; } .content-box-army { background-color: #737a36; } .content-box-green { background-color: #d9edc2; } .content-box-purple { background-color: #e2e2f9; } .content-box-red { background-color: #ffcccc; } .content-box-yellow { background-color: #fef5c4; } .content-box-blue .remark-inline-code, .content-box-blue .remark-inline-code, .content-box-gray .remark-inline-code, .content-box-grey .remark-inline-code, .content-box-army .remark-inline-code, .content-box-green .remark-inline-code, .content-box-purple .remark-inline-code, .content-box-red .remark-inline-code, .content-box-yellow .remark-inline-code { background: none; } .full-width { display: flex; width: 100%; flex: 1 1 auto; } </style> <style type="text/css"> blockquote, .blockquote { display: block; margin-top: 0.1em; margin-bottom: 0.2em; margin-left: 5px; margin-right: 5px; border-left: solid 10px #0148A4; border-top: solid 2px #0148A4; border-bottom: solid 2px #0148A4; border-right: solid 2px #0148A4; box-shadow: 0 0 6px rgba(0,0,0,0.5); /* background-color: #e64626; */ color: #e64626; padding: 0.5em; -moz-border-radius: 5px; -webkit-border-radius: 5px; } .blockquote p { margin-top: 0px; margin-bottom: 5px; } .blockquote > h1:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h2:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h3:first-of-type { margin-top: 0px; margin-bottom: 5px; } .blockquote > h4:first-of-type { margin-top: 0px; margin-bottom: 5px; } .text-shadow { text-shadow: 0 0 4px #424242; } </style> <style type="text/css"> /****************** * Slide scrolling * (non-functional) * not sure if it is a good idea anyway slides > slide { overflow: scroll; padding: 5px 40px; } .scrollable-slide .remark-slide { height: 400px; overflow: scroll !important; } ******************/ .scroll-box-8 { height:8em; overflow-y: scroll; } .scroll-box-10 { height:10em; overflow-y: scroll; } .scroll-box-12 { height:12em; overflow-y: scroll; } .scroll-box-14 { height:14em; overflow-y: scroll; } .scroll-box-16 { height:16em; overflow-y: scroll; } .scroll-box-18 { height:18em; overflow-y: scroll; } .scroll-box-20 { height:20em; overflow-y: scroll; } .scroll-box-24 { height:24em; overflow-y: scroll; } .scroll-box-30 { height:30em; overflow-y: scroll; } .scroll-output { height: 90%; overflow-y: scroll; } </style> .content-box-green[**A Useful tip:**] hitting letter "o" key will give you a panel view of the slides --- class: middle # Outline 1. Quick overview 2. Sampling distribution 3. Variance of Least squares Estimator ??? --- class: inverse, center, middle name: intro # Quick overview: where we are <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- class: middle ## The big picture <p style="text-align: center;">.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-2.ph2.mt5[<span style="color:blue">CEF</span>: `\(Y=E[Y|X]+\varepsilon\)`] </p> <p style="text-align: center;">↓</p> .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-2.ph2.mt2[ Aproximate CEF with <span style="color:blue">*linear projection model*</span> (population regression function) `$$Y=X \beta^{\prime} + e$$` Then, The <span style="color:blue">*linear projection coefficient*</span> is `$$\beta = \arg \min_{b} E[(Y-X^{\prime}b)] = E[X^{\prime}X]^{-1}E[XY]$$` ] <p style="text-align: center;">↓</p> .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-2.ph2.mt2[ <span style="color:blue">With sample</span>, estimate Linear projection coefficient `\(\beta\)` with the <span style="color:blue">*least square estimator (OLS)*</span> (population regression function) `$$\hat{\beta} = \arg \min_{\beta} \frac{1}{n} \sum_{i=1}^{n}(Y_i - X_{i}^{\prime}\beta)^2 = \left(\sum_{i=1}^{n} X_{i}X_{i}^{\prime} \right)^{-1}\left(\sum_{i=1}^{n} X_{i}Y_{i} \right)=(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{Y})$$` ] ??? + Okay, it has been five weeks since the fall semester started. It's only five weeks passed but we learned a lot of econometric theory, right? + Okay, we started with Conditional expectation function. --- class: middle <img src="data:image/png;base64,#sampring.png" width="100%" style="display: block; margin: auto;" /> + We will learn lots about inference (e.g., asymptotic theory, resampling methods)in APEC8212! --- class: inverse, center, middle name: intro # Sampling distribution <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- class: middle .content-box-red[**Sampling distribution**] + <span style="color:red">The sampling distribution of an estimator is the probability distribution of the estimator.</span> + <span style="color:blue">Every statistic (e.g., sample mean, median, OLS estimator, t-statistic, standard error ...) has a sampling distribution</span> + <span style="color:blue">Properties of an estimator (such as unbiasedness, consistency, and variance) are all about the sampling distribution.</span> --- class: middle # Why does sampling distribution matter? Consider the following model, `$$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + e$$` Suppose that you got `\(\hat{\beta}_1>0\)`. <br> .content-box-green[**Question**] + Does the result of `\(\hat{\beta}_1>0\)` alone provide the evidence that `\(educ\)` has a positive impact on `\(wage\)`? ??? + This is related to the question 2-(b) in assignment 3. In that problem, expected value of sample mean is equal to the population mean, but a specific realization of sample mean is not equal to the population mean. --- ## Simulation + Suppose that the size of the population is `\(10000\)`, and the size of your sample is `\(100\)`. .medium-code[ .panelset[ .panel[.panel-name[Code] ```r set.seed(2346) # population size N <- 10^4 # sample size n <- 100 # === Population data === # id <- 1:N educ <- rnorm(N, mean = 0, sd = 12) exper <- runif(N, min = 0, max = 20) data <- data.table( educ = educ, exper = exper, lwage = 1 + 0*educ + 3*exper + rnorm(N, mean = 20, sd = 6) ) # === Estimate the coefficient of "educ" with 10 different samples === # for(i in 1:10){ sample_id <- sample(1:N, size=100, replace=F) sample <- data[id %in% sample_id,] reg <- lm(lwage ~ educ + exper, sample) print(paste0("sample ", i, ": coefficient of educ is ", coef(reg)[["educ"]])) } ``` ] .panel[.panel-name[Result] ``` ## [1] "sample 1: coefficient of educ is 0.0173603826044669" ## [1] "sample 2: coefficient of educ is 0.0375523820705352" ## [1] "sample 3: coefficient of educ is 0.0621289369398736" ## [1] "sample 4: coefficient of educ is 0.138616905592951" ## [1] "sample 5: coefficient of educ is -0.00109708126591545" ## [1] "sample 6: coefficient of educ is -0.0526003802635388" ## [1] "sample 7: coefficient of educ is -0.0405833137506325" ## [1] "sample 8: coefficient of educ is -0.0680550707052675" ## [1] "sample 9: coefficient of educ is -0.0689725359432311" ## [1] "sample 10: coefficient of educ is 0.0202357134709651" ``` ] ] ] <span font-size = 0.3em>By the way, this is not how you do Monte Carlo simulation. Don't imitate!</span> ??? + Look at the results from 3rd sample. The coefficient estimate of education is `\(0.13\)`, meaning one more year of education increases wages 13% on average. The size of the estimate is pretty big. --- class: middle Because of <span style="color:red">sampling variability</span>, you get different realization of `\(\hat{\beta}_1\)`. This means even if `\(\beta=0\)` in the population, it is possible to get an estimate `\(\hat{\beta}_1\)` that is far from `\(0\)` --- ## Sampling distribution To simulate the sampling distribution of `\(\beta_1\)`, repeat the estimation of `\(\beta_1\)` many many times. .medium-code[ .panelset[ .panel[.panel-name[Let's try it!] Goal: Simulate the sampling distribution of `\(\hat{\beta}_1\)`. (1) Sampling 100 random observations. (2) Run a regression with sample. (3)Save the <span style="color:blue">coefficient estimate</span> and <span style="color:blue">color</span> of `\(educ\)`. (4) Repeat (1)-(3) B times. (5)Plot the sampling distribution of `\(\hat{\beta}_1\)` (e.g, `hist(res_data$coef, breaks=50)`). ```r # the number of iterations B = 2000 # create an object to store the results res_dt <- # Simulation for (i in 1:B){ # 1.select 100 observations randomly from data to make a sample # 2.run a regression with sample data # 3. store the results } ``` ] .panel[.panel-name[Code] ```r # the number of iterations B = 2000 res_data <- data.table( coef = rep(0, B), se = rep(0, B) ) for(i in 1:B){ sample <- data[id %in% sample(1:n, size=100, replace=F)] sample_id <- sample(1:N, size=100, replace=F) sample <- data[id %in% sample_id,] reg <- lm(lwage ~ educ + exper, sample) res_data[i, "coef"] <- coef(reg)[["educ"]] res_data[i, "se"] <- sqrt(diag(vcov(reg))[["educ"]]) } # === Visualization === # vis_sampling <- ggplot(res_data) + geom_histogram(aes(x=coef) , alpha = 0.8)+ geom_vline(aes(xintercept=mean(coef), color = "red"))+ annotate("text", x = mean(res_data$coef), y = -10, label = expression("Mean of " ~ hat(beta[1]) ~ " is 0"), size = 3)+ theme_bw() + theme(legend.position="none") ``` ] .panel[.panel-name[Result] <img src="data:image/png;base64,#recitation5_slides_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] ] ] ??? + Although there are some large estimates in the left and right tails, but the sampling distribution is centered at zero, in other words, most of the time, you get the coefficient estimates very close to zero. + Based on the sampling distribution, we conduct hypothesis testing. We will learn a lot of hypothesis testing in APEC8212. --- class: middle <span style="color:red">The standard error is an *estimator/estimate* of the standard deviation of the sampling distribution.</span> .content-box-green[**Let's check this**] 1. Calculate the standard deviation of the sampling distribution 2. Calculate the mean of the standard errors you estimated 3. Compare them. .medium-code[ .panelset[ .panel[.panel-name[Let's check it!] ```r # sd of the sampling distribution # mean of the se ``` ] .panel[.panel-name[Code and Results] ```r # sd of the sampling distribution sd(res_data$coef) ``` ``` ## [1] 0.04068293 ``` ```r # mean of the se mean(res_data$se) ``` ``` ## [1] 0.0422298 ``` ]] ] --- class:middle + **The problem is that we never know the actual sampling distribution of an estimator** * This calls for CLT or Delta method * or resampling methods (APEC8212) PSE P134 >The goal of an estimator `\(\hat{\theta}\)` is to learn about the parameter `\(\theta\)`. To make accurate inferences and to measure the accuracy of our measurements, we need to know something about its sampling. Therefore a considerable effort in statistical theory is devoted to understanding the sampling distribution of estimators. --- # Unbiasedness and Consistency .content-box-red[**Unbiasedness**] `$$E[\hat{\theta}_n]=\theta$$` **Verbally:** The central tendency of the sampling distribution `\((E[\hat{\theta}_n])\)` corresponds to the population parameter `\(\theta\)`. + Bias of `\(\hat{\theta}_n\)` is `\(E[\hat{\theta}_n] - \theta\)` .content-box-red[**Consistency**] `$$\text{For } \forall \delta > 0, \quad Pr(|\hat{\theta}_n-\theta| \leq \delta) = 1 \quad \text{as } n \rightarrow \infty$$` <p style="text-align: center;">or</p> `$$\hat{\theta}_n \xrightarrow{p} \theta \quad \text{as } n \rightarrow \infty$$` **Verbally:** <span style="color:blue">As `\(n \rightarrow \infty\)`, </span>the sampling distribution of `\(\hat{\theta}_n\)` converges in probability to the population parameter `\(\theta\)`. --- # Exercise <span style="color:red">Generally, unbiasedness does not imply consistency, and vice versa.</span> <br> Let's think about a couple of unbiased estimators for the population mean. + Suppose that we have a i.i.d sample `\(\{X_i\}_{i=1}^{n}\)` which are taken from the population with mean `\(E[X]=\mu\)` and `\(Var[X] = \sigma^2\)` Candidates: + Estimator 1: Sample mean: `\(\overline{X}_n=\frac{1}{n}\sum_{i=1}^{n} X_i\)` + Estimator 2: Use only 1st observations: `\(X_1\)` .content-box-green[**Question**] + Show that `\(X_1\)` is an unbiased estimator for the population mean `\(\mu\)` + Is `\(X_1\)` a consistent estimator for the population mean `\(\mu\)`? --- # Monte Carlo simulations .medium-code[ .panelset[ .panel[.panel-name[Let's try it!] Let's write codes to see the sampling distribution of `\(\overline{X}_n\)` and `\(X_1\)`. .content-box-green[**Steps**] (1) Generate n=1000 random numbers (e.g, `rnorm(n, mean = 2, sd = 1)`, `rchisq(n, df = 4)`) (2) Obtain `\(\overline{X}_n\)` and `\(X_1\)` . (3)Save the result. (4) Repeat (1)-(3) B=1000 times. (5) Plot the sampling distribution of `\(\overline{X}_n\)` and `\(X_1\)` . (6) **For an extra challenge**, produce sampling distributions of `\(\overline{X}_n\)` and `\(X_1\)` with `\(n = 100, 500, 1000\)` and see how the shapes of the distributions change. ] .panel[.panel-name[Code 1: n=1000] ```r # the number of iterations B = 1000 n = 1000 mu = 10 res_data <- data.table( x_bar = rep(0, B), x_single = rep(0, B) ) for(i in 1:B){ data <- rchisq(n, df=mu) x_bar <- mean(data) x_single <- data[1] res_data[i, "x_bar"] <- x_bar res_data[i, "x_single"] <- x_single } # === Visualization === # # Just for convenience, transform the data to long-format (you don't need to do this though) vis_sampling <- ggplot(res_data) + geom_histogram(aes(x=x_bar, fill="x_bar"), bins = 50, alpha = 0.5)+ geom_histogram(aes(x=x_single, fill="x_single",), bins = 50, alpha = 0.5)+ labs(title="Sample size 1000")+ theme_bw()+ # modify fill color scale_color_manual(values = c("x_bar"="red", "x_single"="blue"))+ # modify x-label and legend name labs(x = expression (~hat(beta)), fill = "Estimator") ``` ] .panel[.panel-name[Result 1] ```r vis_sampling ``` <img src="data:image/png;base64,#recitation5_slides_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code 2: n=100,500,1000] .small[ ```r # the number of iterations B <- 1000 ls_n <- c(100, 500, 1000) mu <- 10 # make a function get_mean <- function(sample_size){ storage <- data.table( sample_size = rep(sample_size, B), x_bar = rep(0, B), x_single = rep(0, B) ) for(i in 1:B){ data <- rnorm(sample_size, mean=mu, sd=5) storage[i, "x_bar"] <- mean(data) storage[i, "x_single"] <- data[1] } return(storage) } # run simulations ls_res_dt <- lapply(ls_n, function(x) get_mean(sample_size=x)) res_dt <- rbindlist(ls_res_dt) # === Visualization === # vis_sampling <- ggplot(res_dt) + geom_histogram(aes(x=x_bar, fill = "x_bar"), bins = 50, alpha = 0.5)+ geom_histogram(aes(x=x_single, fill = "x_single"), bins = 50, alpha = 0.5)+ facet_wrap(~ sample_size, nrow=1)+ theme_bw()+ # modify fill color scale_color_manual(values = c("x_bar"="red", "x_single"="blue"))+ # modify x-label and legend name labs(x = expression (~hat(beta)), fill = "Estimator") ``` ] .panel[.panel-name[Result 2] ```r vis_sampling ``` <img src="data:image/png;base64,#recitation5_slides_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> ] ] ] ] --- class: inverse, center, middle name: intro # Variance of Least squares Estimator <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- class: middle .content-box-red[**Important!**] `$$\mathbf{V}_{\hat{\beta}} = Var[\hat{\beta}|\mathbf{X}]=\mathbf{(X^{\prime}X)^{-1}X^{\prime}DX(X^{\prime}X)^{-1}}$$` , where `\(\mathbf{D}=Var[\mathbf{e|X}]=E[\mathbf{ee^{\prime}|X}]\)` .content-box-green[**Question**] + Let's derive this formula.